Generalised Brown Clustering and Roll-Up Feature Generation

نویسندگان

  • Leon Derczynski
  • Sean Chester
چکیده

Brown clustering is an established technique, used in hundreds of computational linguistics papers each year, to group word types that have similar distributional information. It is unsupervised and can be used to create powerful word representations for machine learning. Despite its improbable success relative to more complex methods, few have investigated whether Brown clustering has really been applied optimally. In this paper, we present a subtle but profound generalisation of Brown clustering to improve the overall quality by decoupling the number of output classes from the computational active set size. Moreover, the generalisation permits a novel approach to feature selection from Brown clusters: We show that the standard approach of shearing the Brown clustering output tree at arbitrary bitlengths is lossy and that features should be chosen insead by rolling up Generalised Brown hierarchies. The generalisation and corresponding feature generation is more principled, challenging the way Brown clustering is currently understood and applied.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Quality of Hierarchical Clustering for Large Data Series

Brown clustering is a hard, hierarchical, bottom-up clustering of words in a vocabulary. Words are assigned to clusters based on their usage pattern in a given corpus. The resulting clusters and hierarchical structure can be used in constructing class-based language models and for generating features to be used in natural language processing (NLP) tasks. Because of its high computational cost, ...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

Optimal Feature Selection for Data Classification and Clustering: Techniques and Guidelines

In this paper, principles and existing feature selection methods for classifying and clustering data be introduced. To that end, categorizing frameworks for finding selected subsets, namely, search-based and non-search based procedures as well as evaluation criteria and data mining tasks are discussed. In the following, a platform is developed as an intermediate step toward developing an intell...

متن کامل

An automatic rule base generation method for fuzzy pattern recognition with multiphased clustering

This paper presents a new approach for the automatic generation of fuzzy rule bases for pattern recognition. The general idea of the approach is to use and enhance the fuzzy c-means clustering algorithm. The rule base is generated through an iterative feature clustering approach. The automatic extraction of features is repeated until the generated rule base is giving an unequivocal answer. Alth...

متن کامل

Computational evaluation of the homogeneity of composites processed by accumulative roll bonding (ARB)

A new computational method based on MATLAB was used to study the effect of different parameters on the homogeneity of composites produced by a severe plastic deformation technique known as accumulative roll bonding. For a higher number of passes, the degree of particle agglomeration and clustering decreased, and an appreciable homogeneity was obtained in both longitudinal and transverse directi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016